A Scalable Approach to Building a Parallel Corpus from the Web

نویسندگان

  • Vivek Kumar Rangarajan Sridhar
  • Luciano Barbosa
  • Srinivas Bangalore
چکیده

Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, crosslingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy of the crawler to the graph neighborhood of bilingual sites on the Web. Subsequently, we use a novel recursive mining technique that recursively extracts text and links from the collection of bilingual Web sites obtained from the crawling. Our method does not suffer from the computationally prohibitive combinatorial matching typically used in previous work that uses document retrieval techniques to match a collection of bilingual webpages. We demonstrate the efficacy of our approach in the context of machine translation in the tourism and hospitality domain. The parallel text obtained using our novel crawling strategy results in a relative improvement of 21% in BLEU score (English-to-Spanish) over an out-of-domain seed translation model trained on the European parliamentary proceedings.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture

Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...

متن کامل

A Tandem Scalable Microwave-Assisted Williamson Alkyl Aryl Ether Synthesis under Mild Conditions

An efficient tandem synthesis of alkyl aryl ethers, including valuable building blocks of dialdehyde and dinitro groups under microwave irradiation and solvent free conditions on potassium carbonate as a mild solid base has been developed. A series of alkyl aryl ethers were obtained from alcohols in excellent yields by following the Williamson ether synthesis protocol under practical mild condi...

متن کامل

WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content

This paper investigates the extent to which a useful general purpose Translation Memory (TM) can be built based on very large amounts of heterogeneous parallel texts mined from the Web. In particular, we evaluate whether such a TM could add value over TMs built from other large, publicly available parallel corpora, such as the Canadian Hansard. In the case of Canadian translators working with E...

متن کامل

CLIR using a Probabilistic Translation Model based on Web Documents

In this report, we describe the approach we used in TREC-8 Cross-Language IR (CLIR) track. The approach is based on probabilistic translation models estimated from two parallel training corpora: one established manually, and the other built automatically with the documents mined from the Web. We describe the principle of model building, the mining of parallel texts, as well as some preliminary ...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011